MoodyDuck

A DuckDB Interface to Orbis

Richard Winter

2026-05-28

Part I

Why MoodyDuck?

The Problem With Orbis Flatfiles

Orbis delivers data as large TSV files

  • Dozens of files, each potentially millions of rows
  • Too large to use in memory directly
  • Painful to reload selections for every project
  • Local copies diverge across researchers
  • Go through each table manually to subset and merge

Typical workflow (before)

1. Receive BvD flatfiles
2. Try to load TSV into Stata / R
3. Wait
4. Run out of memory
5. Subset the TSV manually
6. Reload
7. Repeat for every researcher and project

Note

But why not just use the Orbis User Interface?

  1. Limited connections available
  2. Subject to internet speed
  3. Reproducibility: Flatfiles are clearly versioned while Orbis database itself is subject to ongoing updates and changes

The Solution: DuckDB-powered Database Format

MoodyDuck packages all Orbis flatfiles into a single
high-performance database file.

📁 One file: orbis_202506.duckdb
on GBP, Area & Cobra Workstations

⚡ Fast and memory efficient: query data without loading it into memory

🔗 No server: Runs inside your R or Python session directly

What Is DuckDB?

An analytical processing database built for data science

  • Runs in-process: no server to start or manage
  • Reads only required columns, processes data in bulk
  • Fast aggregations and joins
  • Handles datasets far larger than RAM via spill-to-disk
  • Full SQL, plus dplyr (R) and pandas/polars (Python)
  • Multiple simultaneous read-only connections –> no conflicts

Part II

What’s in MoodyDuck?

48+ Tables Across 6 Domains

💰 Financial Statements financials_industry · _eur · _usd
financials_banks · financials_insurances
key_financials · _eur · _usd
12 tables

🏢 Firm Demographics entities · identifiers · names
legal_info · status_history
contact_info · addresses· overviews
8 tables

🔗 Ownership Links links_current
links_2007 through links_2024
19 tables

🏭 Industry industry_classifications
NACE Rev. 2 · SIC · NAICS
1 table

👥 Advisors auditors_current
bankers_current/previous
dmc_current/previous
5 tables

📈 Market stock_indexes
1 table

The Key Identifier: bvd_id_number

Every table uses bvd_id_number as the firm identifier.

The first two characters encode the country iso code of the company:

Prefix Country
DE Germany
FR France
GB United Kingdom
US United States
NL Netherlands

Country filtering is therefore straightforward.

-- Filter to German firms
WHERE SUBSTR(bvd_id_number, 1, 2) = 'DE'

-- Filter to multiple countries
WHERE SUBSTR(bvd_id_number, 1, 2)
      IN ('DE', 'FR', 'AT', 'CH')

⚠️ The prefix usually reflects the country of incorporation. Cross-check with legal_info.country_iso_code for critical analyses.

Part III

How can you access MoodyDuck?

Workstation Overview

GBP Workstation

Remote Desktop access

D:\OrbisDB\Data\duckdb\

Area Workstation

Remote Desktop access

Q:\Users\Richard Winter\ OrbisDB\Data\duckdb\

Cobra Workstation

On-site · Room SO 104

E:\MoodyduckDB\Data\duckdb\

Remote Desktop Access: GBP & Area

  1. Open Remote Desktop Connection (Win + Rmstsc)
  2. Enter the server address (contact the database administrator)
  3. Log in with your Workstation Credentials → contact WS administrator if you don’t have them

GBP Workstation

D:\OrbisDB\Data\duckdb\orbis_202506.duckdb

Area Workstation

Q:\Users\Richard Winter\OrbisDB\Data\duckdb\orbis_202506.duckdb

WS Admin: pascal.schrader@uni-mannheim.de

Cobra Workstation: On-Site Access

Room SO 104: Chair of Taxation, Accounting, and Finance, Center for Corporate Behavior and Regulation Analysis (COBRA)

  • One person at a time
  • Log in locally with special Orbis user credentials
  • Extracted data can be taken away on a USB drive
E:\MoodyduckDB\Data\duckdb\orbis_202512.duckdb

⚠️ Cobra runs a different vintage (orbis_202512) than GBP and Area (orbis_202506). Do not mix vintages across scripts in the same project.

WS Admin: matti.boie-wegener@uni-mannheim.de

Part IV

Getting Started

Installation

R

install.packages("duckdb")
install.packages("tidyverse")

# Recommended extras
install.packages("arrow")    # Parquet I/O

Python

pip install duckdb pandas

# Recommended extras
pip install polars      # faster than pandas
pip install pyarrow     # Parquet I/O

Connecting to the Database

R

library(duckdb)
library(tidyverse)

con <- DBI::dbConnect(
  duckdb::duckdb(),
  dbdir = "D:/OrbisBackup/orbis_202506.duckdb",
  read_only = TRUE # ← always
)

Python

import duckdb

con = duckdb.connect(
    database  = "D:/OrbisBackup/orbis_202506.duckdb",
    read_only = True   # ← always
)

⚠️ Always use read_only = TRUE.
One write-mode connection locks the file for all users on the server.

Exploring What’s There

R

# List all tables
DBI::dbListTables(con)

# Column names and types
DBI::dbGetQuery(con, "DESCRIBE financials_industry_eur")

# Peek at first rows
tbl(con, "financials_industry_eur") |>
  glimpse()

Python

# List all tables
con.execute("SHOW TABLES").df()

# Column names and types
con.execute(
  "DESCRIBE financials_industry_eur"
).df()

# Preview first rows
con.execute("""
  SELECT * FROM financials_industry_eur
  LIMIT 5
""").df()

Part V

Querying the Database

The Most Important Concept: Lazy Evaluation

Queries are not executed until you ask for the result.

R

# Nothing loaded yet ↓
query <- tbl(con, "financials_industry_eur") |>
  filter(substr(bvd_id_number, 1, 2) == "DE") |>
  select(bvd_id_number, financial_year, total_assets)

# Execution happens here ↓
result <- collect(query)

Use show_query(query) to see the generated SQL.

Python

# Nothing loaded yet ↓
query = con.sql("""
  SELECT bvd_id_number, financial_year,
         total_assets
  FROM financials_industry_eur
  WHERE SUBSTR(bvd_id_number, 1, 2) = 'DE'
""")

# Execution happens here ↓
result = query.df()    # pandas
result = query.pl()    # polars

Simple Query Example

R (dplyr style)

tbl(con, "financials_industry_eur") |>
  filter(
    substr(bvd_id_number, 1, 2) == "DE",
    financial_year == 2022,
    consolidation_code %in% c("U1", "U2")
  ) |>
  select(bvd_id_number, financial_year, total_assets, number_of_employees) |>
  head(10) |>
  collect()

Python (SQL)

con.execute("""
  SELECT bvd_id_number, financial_year,
         total_assets, number_of_employees
  FROM   financials_industry_eur
  WHERE  SUBSTR(bvd_id_number, 1, 2) = 'DE'
    AND  financial_year = 2022
    AND  consolidation_code IN ('U1', 'U2')
  LIMIT  10
""").df()

💡 Financial files are not firm-year panels. A firm may have multiple entries across different years (e.g., due to different consolidation levels, filing types, etc.).

A More Involved Example: Joining Multiple Tables

R

fin <- tbl(con, "financials_industry_eur") |>
  filter(
    substr(bvd_id_number, 1, 2) == "DE",
    financial_year %in% 2020:2022,
    consolidation_code %in% c("U1", "U2")
  ) |>
  distinct(bvd_id_number, financial_year, .keep_all = TRUE)

ind <- tbl(con, "industry_classifications") |>
  filter(!is.na(nace_rev_2_main_section)) |>
  select(bvd_id_number, nace_rev_2_main_section)

leg <- tbl(con, "legal_info") |>
  select(bvd_id_number, status, standardised_legal_form)

df <- fin |>
  left_join(ind, by = "bvd_id_number") |>
  left_join(leg, by = "bvd_id_number") |>
  collect()

Python

query = """
  WITH fin AS (
    SELECT DISTINCT ON (bvd_id_number,
                        financial_year) *
    FROM financials_industry_eur
    WHERE SUBSTR(bvd_id_number,1,2) = 'DE'
      AND financial_year BETWEEN 2020 AND 2022
      AND consolidation_code IN ('U1','U2')
  ),
  ind AS (
    SELECT 
      bvd_id_number, nace_rev_2_main_section
    FROM industry_classifications
    WHERE nace_rev_2_main_section IS NOT NULL
  ),
  leg AS (
    SELECT bvd_id_number, status,
           standardised_legal_form
    FROM legal_info
  )
  SELECT f.*, i.nace_rev_2_main_section,
         l.status, l.standardised_legal_form
  FROM fin f
  LEFT JOIN ind i USING (bvd_id_number)
  LEFT JOIN leg l USING (bvd_id_number)
"""
df = con.execute(query).df()

Aggregations: Let DuckDB Do the Work

Aggregate inside DuckDB — don’t collect first and then summarise in R/Python.

R

tbl(con, "financials_industry_eur") |>
  filter(
    substr(bvd_id_number, 1, 2) == "DE",
    financial_year %in% 2020:2022,
    consolidation_code %in% c("U1", "U2")
  ) |>
  group_by(financial_year) |>
  summarise(
    n_firms = n(),
    med_assets = median(total_assets, na.rm = TRUE),
    mean_emp = mean(number_of_employees, na.rm = TRUE)
  ) |>
  collect()

Python

con.execute("""
  SELECT
    financial_year,
    COUNT(*)                 AS n_firms,
    MEDIAN(total_assets)     AS med_assets,
    AVG(number_of_employees) AS mean_emp
  FROM financials_industry_eur
  WHERE SUBSTR(bvd_id_number, 1, 2) = 'DE'
    AND financial_year BETWEEN 2020 AND 2022
    AND consolidation_code IN ('U1', 'U2')
  GROUP BY financial_year
""").df()

Saving Your Results

R

# R-native (fast, preserves types)
saveRDS(df, "data/sample.rds")

# Parquet — recommended for sharing
arrow::write_parquet(df, "data/sample.parquet")

# Export to Stata
haven::write_dta(df, "data/sample.dta")

# Write Parquet directly from DuckDB
DBI::dbExecute(
  con,
  "
  COPY (SELECT ...) TO 'data/de.parquet'
  (FORMAT PARQUET)
"
)

Python

# Parquet — recommended
df.to_parquet("data/sample.parquet",
              index=False)

# CSV
df.to_csv("data/sample.csv", index=False)

# Export to Stata
import pyreadstat
pyreadstat.write_dta(df, "data/sample.dta")

# Write Parquet directly from DuckDB
con.execute("""
  COPY (SELECT ...) TO 'data/de.parquet'
  (FORMAT PARQUET)
""")

💡 Prefer Parquet. Smaller than CSV, faster to read, preserves types. Readable by R, Python, Stata 18+, and DuckDB.

Part VI

Shared Server Resource Management: Don’t make everybody angry with you

The Shared Server Problem

MoodyDuck runs on shared Windows Server workstations.
By default, DuckDB claims:

  • RAM: up to 80% of system memory
  • CPU: all availablelogical cores

Since you’re not the only one using these resources this may lead to:

  • Out-of-memory errors for everyone
  • Machine becoming unresponsive
  • Crashes that affect all users

How to avoid these issues: set limits at session start

# R
DBI::dbExecute(con,
  "SET memory_limit = '20GB'")
DBI::dbExecute(con,
  "SET threads = 4")
# Python
con.execute("SET memory_limit = '20GB'")
con.execute("SET threads = 4")

⚠️ memory_limit controls DuckDB’s internal buffer only — not the memory used when R/Python stores the collected result. Only collect the rows and columns you actually need.

Spill-to-Disk and Temp Files

When DuckDB exceeds memory_limit, it spills intermediate results to a temp directory.

Default: C:\Windows\Temp — often slow, shared, limited space.

Redirect to a fast dedicated location:

# R
DBI::dbExecute(con,
  "SET temp_directory = 'D:/DuckDBTemp'")
# Python
con.execute(
  "SET temp_directory = 'D:/DuckDBTemp'")

If queries are spilling often…

Don’t just raise memory_limit.
Instead:

  • Filter earlier
  • Select fewer columns
  • Split by year or country
  • Aggregate inside DuckDB first

Always Close Your Connection

R

con <- DBI::dbConnect(
  duckdb::duckdb(),
  dbdir = "D:/OrbisBackup/orbis_202506.duckdb",
  read_only = TRUE
)

tryCatch(
  {
    DBI::dbExecute(con, "SET memory_limit = '20GB'")
    DBI::dbExecute(con, "SET threads = 4")

    # ... your analysis ...
  },
  finally = {
    DBI::dbDisconnect(con, shutdown = TRUE)
    # ↑ runs even if an error occurs
  }
)

Python

# Context manager closes automatically
with duckdb.connect(
    "D:/OrbisBackup/orbis_202506.duckdb",
    read_only=True
) as con:
    con.execute("SET memory_limit = '20GB'")
    con.execute("SET threads = 4")

    # ... your analysis ...

# Connection closed here, even on error

⚠️ Multiple read-only connections are fine. One write-mode connection locks the file for everyone.

Copy-Paste Boilerplate

R

library(duckdb)
library(DBI)
library(tidyverse)

con <- DBI::dbConnect(
  duckdb::duckdb(),
  dbdir = "D:/OrbisBackup/orbis_202506.duckdb",
  read_only = TRUE
)

DBI::dbExecute(con, "SET memory_limit   = '20GB'") # ← adjust
DBI::dbExecute(con, "SET threads        = 4") # ← adjust
DBI::dbExecute(con, "SET temp_directory = 'D:/DuckDBTemp'")

# --- your analysis here ---

DBI::dbDisconnect(con, shutdown = TRUE)

Python

import duckdb
import pandas as pd

con = duckdb.connect(
    database  = "D:/OrbisBackup/orbis_202506.duckdb",
    read_only = True
)

con.execute("SET memory_limit   = '20GB'") # ← adjust
con.execute("SET threads        = 4")       # ← adjust
con.execute("SET temp_directory = 'D:/DuckDBTemp'")

# --- your analysis here ---

con.close()

Interactive Mode & Documentation

On Area and GBP workstations, connect to the MoodyDuck Hub via localhost:4239 in your web browser.

  • Access to slides and documentation.

  • Access to interactive Graphical User Interface to Query & Explore Company Info, Financials, Industry, and Ownership Links.

Note

The MoodyDuck Hub and the GUI App are still under development and may be unavailable at times when new updates are being rolled out. Best used for small queries and quick lookups, not for heavy data extraction.

Get in Touch

Feel free to reach out if you want to contribute to this project or for questions on any of the following:

📦   Orbis & MoodyDuck: data coverage, table structure, access, updates

🤖   AI tools: Claude Code, LLM workflows


✉️   richard.winter@uni-mannheim.de